In [19]:
import pandas as pd 
import numpy as np
import os 
import matplotlib.pyplot as plt
import plotly.express as px
from collections import defaultdict
import json
from collections import Counter
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from IPython.core.display import display, HTML
#%matplotlib inline  
In [22]:
## Access to all the data 
display(HTML("""<a href="https://github.com/tingtingbun/interactiveds/tree/master/interactive">datafile</a>"""))

PART I: Data Loading

Data Sources: There are a total of three data files from seperate sources

a. file one: a file with seperate excel sheet containing baby names from social security application for all states (including the district of columbia from 1910 - 2018)

b. file two: a single excel sheet containing the median income for all states (including the district of columbia) from 1984 - 2018

c. file three: a json file that records the corresponding state codes for all states

Data Description

a. file one: five fields ['State','Names','Year of birth','Number of Occurrences']

b. file two: two fields ['Year','Median Income']

c. file three: sample format {"state code 1":"state 1"}

Data Integration

a. Based on file three, we replace the state name with the state code to prepare for the table join on file one and file two

b. Join file one and file two on two fields ['Year','State']

c. Since we do not have data points on year prior to 1984 on median income, we would restrict our final dataset to 1984 - 2018

d. Other remarks: there is no missing value in the dataset

In [4]:
### dataset loading 
def data_loading (name_file, state_file, us_abbre_file):
    df = pd.concat([pd.read_csv(name_file+'/' + i, sep=",",header=None) for i in os.listdir('namesbystate') if not i.startswith('.')]\
                  ,ignore_index=True)
    df.columns = ['State','Sex','Year_of_birth','Name','Num_of_occurrences']
    print('There are {} states in the dataset and they are {}'.format(len(df.State.unique().tolist()),df.State.unique().tolist()))
    ### To check all 50 + D.C. are loaded correctly
    
    ### loading income dataset
    df_income = pd.read_excel(state_file, header=[0])
    df_income = pd.melt(df_income, id_vars='State', value_vars=df_income.columns.tolist()[1:])
    df_income.rename(columns={"variable": "year", "value":"median_income"}, inplace = True)
    
    ### loading state abbrev json
    with open (us_abbre_file) as file:
        us_code = {y:x for x,y in json.load(file).items()}
        
    ### replacing abbreviation 
    for index, content in df_income.iterrows():
        if content['State'] == 'United States':
            pass
        else:
            df_income.loc[index,'State'] = us_code[content['State']]
    
    return pd.merge(df_income, df, left_on = ['State','year'], right_on =['State','Year_of_birth'])
In [7]:
a = data_loading('namesbystate','h08-2.xls','state_code.json')
There are 51 states in the dataset and they are ['IN', 'IL', 'KS', 'SC', 'HI', 'GA', 'SD', 'CO', 'NH', 'MS', 'MD', 'UT', 'LA', 'ME', 'WI', 'NJ', 'AR', 'NY', 'MT', 'OK', 'MA', 'NM', 'WY', 'OH', 'OR', 'NV', 'TX', 'TN', 'AZ', 'MN', 'WA', 'WV', 'NC', 'MO', 'AL', 'VA', 'CA', 'CT', 'AK', 'ND', 'VT', 'MI', 'NE', 'KY', 'ID', 'DC', 'IA', 'FL', 'PA', 'RI', 'DE']
In [0]:
print('The head 5 rows of the data \n\n' + a.head().to_string() + '\n\n')

print('Basic description of different fields in the integrated dataset \n\n' + a.describe().to_string())
The head 5 rows of the data 

  State  year  median_income Sex  Year_of_birth    Name  Num_of_occurrences
0    AL  2018          49936   F           2018     Ava                 293
1    AL  2018          49936   F           2018  Olivia                 259
2    AL  2018          49936   F           2018  Harper                 204
3    AL  2018          49936   F           2018    Emma                 202
4    AL  2018          49936   F           2018  Amelia                 183


Basic description of different fields in the integrated dataset 

       median_income  Year_of_birth  Num_of_occurrences
count   2.879400e+06   2.879400e+06        2.879400e+06
mean    4.367943e+04   2.002096e+03        4.061643e+01
std     1.284236e+04   1.017420e+01        1.278418e+02
min     1.543000e+04   1.983000e+03        5.000000e+00
25%     3.366400e+04   1.994000e+03        7.000000e+00
50%     4.346400e+04   2.003000e+03        1.100000e+01
75%     5.201900e+04   2.011000e+03        2.700000e+01
max     8.634500e+04   2.018000e+03        8.243000e+03

PART II: Research Question

Research question 1: For which names are boys as likely to have as girls? Can we detect the gender neutral names and how the number of neural names or percentage of gender neutral names has changed over time

Research question 2: Do baby names reflect the income level of the state over time? Are baby names reflective of any income trends?

Research question 3: Are there trendsetting states in baby names?

In [0]:
 

PART III: Approach to the Research Questions with EDA

Question 1: Gender Neutral Name

The detection will be based on two derived fields and an algorithm designed to calculate a gender neutral score

1) Derived fields

a. Absolute Number Difference $\;\;\;$ b. Percentage Difference after Normalization

2) Algorithm

a. Criteria
I.high total name occurrences -impact $\;\;\;$ II.low percentage difference -neutrality
b. Rationale
- After looking at the percentage difference and total count of each name, I realized that there is a trade-off between a name being impactful or being more gender-neutral. To accommodate the trade-off, the algorithm will take two factors into final score calculation; the first factor is the percentage difference between names used for males and females, and the other is the total occurrences of the name.
- Since the objective of the algorithm is to find out gender-neutral names, the weighting coefficient for percentage difference and total count are 1, 0.75, respectively. (we could adjust the coefficients and test different ratio later )
Further notes:
a) we exclude names with percent difference greater than 50% as it indicates strong gender-inclination
b) 'Unknown' is also excluded from the final name lists
In [8]:
def gender_neutral_visualize (df): 
    ### visualization of name lists with equal distribution between genders
    ### and understanding the trade off between percentent difference and total number of occurrences
    
    gender_df = df.groupby(['Name','Sex']).agg({'Num_of_occurrences': 'sum'}).unstack().fillna(0)
    bi_gender = gender_df.Num_of_occurrences[(gender_df.Num_of_occurrences.F > 0) & (gender_df.Num_of_occurrences.M > 0)]
    
    ### Deriving two filds 
    bi_gender['diff'] = abs(bi_gender.F - bi_gender.M)
    bi_gender['percent_dff'] = abs(bi_gender.F - bi_gender.M )/(bi_gender.M+bi_gender.F)
    bi_gender['Total_num'] = bi_gender.M+bi_gender.F 
    absolute_equal = bi_gender[(bi_gender['percent_dff'] == 0)] ### -> this gives a dataframe with names that are equally 
    absolute_equal.reset_index(inplace = True)                 ###    shared between men and women 
    
    ### Plotting for equally shared name and the total occurences of the name
    f, (ax1, ax2) = plt.subplots(1, 2, figsize=(25,10))
    x = absolute_equal.Name.tolist()
    y_pos = np.arange(len(x))
    y = absolute_equal['Total_num'].tolist()

    ax1.bar(y_pos, y, align='center', alpha=0.6)
    ax1.set_xticks(y_pos)
    ax1.set_xticklabels(x, rotation=90)
    ax1.set_ylabel('Num of Occurrences')
    ax1.set_title('Absolute Gender Neutral Names')
    
    max_state = absolute_equal[absolute_equal.Total_num==absolute_equal.Total_num.max()].Name.values[0]
    print(max_state)
    arrow_x = absolute_equal[absolute_equal.Total_num==absolute_equal.Total_num.max()].index.values[0] + 0.5
    arrow_y = absolute_equal.Total_num.max() -1
    ax1.annotate('max name is {} and the total count is  {}'.format(max_state,absolute_equal.Total_num.max()) \
             ,  xy=(arrow_x, arrow_y), xytext=(arrow_x-20, arrow_y), arrowprops=dict(facecolor="black", width=4,headwidth=10, shrink=0.2))
   
    
    
    ### Plotting for percent difference against total number of occurrences 
    ax2.scatter(x=bi_gender['percent_dff'],y=bi_gender['Total_num'])
    ax2.set_xlabel('Normalized Difference')
    ax2.set_ylabel('Num of occurrences')
    ax2.set_title('occurrences and Percent_diff Trade-off')
    

Outcome 1: The list of names that are equally distributed between genders and a trade-off between percent_diff and the total number of occurrences

In [0]:
gender_neutral_visualize(a)  
Rei
In [9]:
def algorithm_calculation (df, location='All'):
    gender_df = df.groupby(['Name','Sex']).agg({'Num_of_occurrences': 'sum'}).unstack().fillna(0)
    bi_gender = gender_df.Num_of_occurrences[(gender_df.Num_of_occurrences.F > 0) & (gender_df.Num_of_occurrences.M > 0)]
    
   
    
    bi_gender['percent_dff'] = abs(bi_gender.F - bi_gender.M )/(bi_gender.M+bi_gender.F)
    bi_gender['Total_num'] = bi_gender.M+bi_gender.F  ### replicating the aggregated and derived dataframe
    bi_gender_ii = bi_gender[(bi_gender['percent_dff']<=0.5) & (bi_gender.index!= 'Unknown')].copy()
   
    bi_gender_ii['percent_diff_rank'] = bi_gender_ii['percent_dff'].rank(ascending=False) ### ranking the total percent_diff, the lower the higher rank score
    bi_gender_ii['Total_num_rank'] = bi_gender_ii['Total_num'].rank() ### ranking the total num, the higher the higher rank score 
    bi_gender_ii['score'] = bi_gender_ii['percent_diff_rank']+0.75*bi_gender_ii['Total_num_rank'] ### combined score calculate by the weighting coefficient
    bi_gender_ii.sort_values(by='score',ascending = False, inplace = True)
    print(bi_gender_ii.head(10).to_string())
    print('\n the top {} suggested highly gender nuetral names in {} are \n {}'.format(len(bi_gender_ii[0:100].index.tolist()),location, bi_gender_ii[0:100].index.tolist()))

Outcome 2 (a): Caculated Scores based on the defined algorithm and top 100 names for all states

In [0]:
algorithm_calculation(a) ### this show name list for all states
Sex            F        M  percent_dff  Total_num  percent_diff_rank  Total_num_rank   score
Name                                                                                        
Kerry    45620.0  46346.0     0.007894    91966.0              494.0           522.0  885.50
Quinn    29791.0  28742.0     0.017922    58533.0              485.0           511.0  868.25
Kris     11020.0  10874.0     0.006668    21894.0              496.0           494.0  866.50
Justice  15229.0  15618.0     0.012611    30847.0              490.0           501.0  865.75
Lennon    3367.0   3339.0     0.004175     6706.0              497.0           466.0  846.50
Santana   3179.0   3173.0     0.000945     6352.0              499.0           463.0  846.25
Landry    3013.0   3019.0     0.000995     6032.0              498.0           461.0  843.75
Emerson  20175.0  18459.0     0.044417    38634.0              459.0           507.0  839.25
Arden     3211.0   3313.0     0.015635     6524.0              486.0           465.0  834.75
Robbie   18642.0  16988.0     0.046422    35630.0              456.0           505.0  834.75

 the top 100 suggested highly gender nuetral names in All are 
 ['Kerry', 'Quinn', 'Kris', 'Justice', 'Lennon', 'Santana', 'Landry', 'Emerson', 'Arden', 'Robbie', 'Jackie', 'Oakley', 'Riley', 'Infant', 'Baby', 'Tristyn', 'Kodi', 'Arlyn', 'Yael', 'Jessie', 'Elisha', 'Amari', 'Maxie', 'Kimani', 'Ivory', 'Nieves', 'Parris', 'Stevie', 'Blair', 'Michal', 'Gentry', 'Burnice', 'Armani', 'Frankie', 'Waverly', 'Carey', 'Natividad', 'Jael', 'Notnamed', 'Jaime', 'Harley', 'Charleston', 'Jaylin', 'Ashten', 'Casey', 'Devyn', 'Daylin', 'Claudie', 'Trinidad', 'Lorenza', 'Ashtin', 'Christan', 'Milan', 'Laramie', 'Devine', 'Peyton', 'Arin', 'Jaedyn', 'Ridley', 'Ocean', 'Ozell', 'Britt', 'Ryley', 'Harpreet', 'Tenzin', 'Jireh', 'Rumi', 'Finley', 'Rei', 'Alva', 'Garnett', 'Pat', 'Hoa', 'Reilly', 'Amrit', 'Rooney', 'Iran', 'Chong', 'Unnamed', 'Remy', 'Briar', 'Skyler', 'Amandeep', 'Teegan', 'Loghan', 'Amen', 'Divine', 'Rowan', 'Mandeep', 'Joell', 'Kalin', 'Charly', 'Cypress', 'Audie', 'Indiana', 'Aziah', 'Shea', 'Jazz', 'Sutton', 'Phoenix']

Outcome 2 (b): Caculated Scores based on the defined algorithm and top 100 names for each states

In [0]:
### this show name list for each state respectively
#for i in a.State.unique().tolist():
    #print('\n')
    #print(i)
    #algorithm_calculation(a[a.State == i],location = i)

Question 2: Income vs Name

Two layers of analysis:

1) Correlation between the state with the most volumn of unique names and its respective median income

a. Regression plot between the two variables $\;\;\;$ b. t test on the variables (recognizing there may be ommitted variable biases)

2) Statistic test between the volumn of names for which the state was the first to use and its respective median income

In [10]:
def regression_plot_income_name (df,year = False):
    if year == False:
        element = ['State']
    else: 
        element = ['State', year]
    
    agg_df = df.groupby(element).agg({'Name':'count','median_income':'mean'})
    agg_df.rename(columns = {'Name':'count of name'}, inplace = True)
    agg_df['state'] = agg_df.index
    return sns.lmplot(y='median_income', x='count of name',data=agg_df) 

a. Regression plot between the two variables

1. level one (aggregate across years for each state) -> the correlation between median income and count of state is not so obvious
2. level two (not aggregate across years) -> we see a slightly stronger correlation between median income and year
In [0]:
regression_plot_income_name(a) 
regression_plot_income_name(a, 'year') 
Out[0]:
<seaborn.axisgrid.FacetGrid at 0x1362e4080>

b. t test on the variables (recognizing there may be ommitted variable biases) -> the p value is low, the coefficient is satistically significant

In [17]:
agg_df = a.groupby(['State', 'year']).agg({'Name':'count','median_income':'mean'})
agg_df.rename(columns = {'Name':'count of name'}, inplace = True)
agg_df['state'] = agg_df.index
In [18]:
y = agg_df.median_income.values.tolist()
x = agg_df['count of name']
model = sm.OLS(x,y)
results = model.fit()
slope = results.params[0]
p_value = stats.t.sf(results.tvalues[0], results.df_resid)

print('The regression of total count of uique names on state median income \n t value : {} and p value of {}'.format(results.tvalues[0],p_value))
The regression of total count of uique names on state median income 
 t value : 53.47311314363534 and p value of 0.0
In [ ]:
 

Question 3: Trendsetter State (Two Approach Verification Method)

Approach 1: State-Based

a. Time series plot of the number of unique names each state has
b. Histogram of the volume of names that the state was the first one to adopt

Approach 2: Name-Based

a. Find out top 500 popular names based on occurrences
b. Bar chart of the percentage of occurrences over each state's total occurrences

for the explanation on why I chose 500, please see below*

Rationale

Limitation

a. the volume over time does not reflect the "setting" aspect since the states could have a larger population and used more names than others.

In [11]:
### Approach 1 a.time series plot of the number of unique names each state has
def name_volumn_visualize (df):
    new_df = df.groupby(['Year_of_birth','State']).agg({'Name': 'count'})
    new_df ['year']= new_df.index.get_level_values('Year_of_birth')
    new_df ['volumn']= new_df['Name']
    new_df ['state'] = new_df.index.get_level_values('State')
    new_df  = new_df.droplevel(['Year_of_birth'])
    fig = px.line(new_df,x='year',y='volumn',color='state')
    fig.show(renderer="notebook")
In [12]:
name_volumn_visualize (a) ### You can hover over any line to see what state it corresponds to 
In [13]:
### Approach 1 b.histogram of the volumn of the names each state was the first to adopt
def first_setter (df):
    new = df.groupby('Name').agg({'Year_of_birth': 'min'})
    new['Name_min']=new.index
    new_df = pd.merge(left=new,right=a, how = 'inner', left_on=['Name_min','Year_of_birth'], right_on=['Name','Year_of_birth'])
    
    ### plotting 
    new_df = new_df.groupby('State').agg({'Name_min':'count'})
    new_df['State']= new_df.index
    fig = px.bar(x=new_df['State'], y=new_df['Name_min'])
    fig.show(render = 'notebook')
In [14]:
first_setter(a)

*Reason behind choosing 500 names

- I performed a grid search over the top 100, 200, 500, 800, 1000, 2000 names to identify the best number to capture the overall popularity (share of the total name occurrences)
- From looking at the percentage change for each increase in number, I noticed that the percentage change start to shrink after 500 and I hence chose 500 as my popularity threshold
In [15]:
### Rationale for choosing 500 
for i in [100,200,500,800,1000,2000]:
    name_df = a.groupby('Name').agg({'Num_of_occurrences':'sum'}) ### identify top 1000 names
    name_df.sort_values(by='Num_of_occurrences',ascending=False, inplace = True)
    capture = name_df[0:i].Num_of_occurrences.sum()/name_df.Num_of_occurrences.sum()*100
    print('the top {} names capture {}% percent of the total names'.format(i,capture))


digit1 = name_df[0:500].Num_of_occurrences.sum()
total = name_df.Num_of_occurrences.sum()
fig1, ax1 = plt.subplots()
ax1.pie([digit1,total-digit1], labels=['500 Names','Remaining Names'], autopct='%1.1f%%',shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
the top 100 names capture 37.87013373188906% percent of the total names
the top 200 names capture 52.30085241128334% percent of the total names
the top 500 names capture 71.69071639518859% percent of the total names
the top 800 names capture 80.4833093302483% percent of the total names
the top 1000 names capture 84.1873727922783% percent of the total names
the top 2000 names capture 92.72148640037854% percent of the total names
In [16]:
### Approach 2: a. list of top 500 names 
###             b. the percentage of babies who used these 500 names in each state  
def popular_converage(df):
    name_df = df.groupby('Name').agg({'Num_of_occurrences':'sum'}) ### identify top 1000 names
    name_df.sort_values(by='Num_of_occurrences',ascending=False, inplace = True)
    top_500 = name_df[0:500].index.tolist()
    
    df_top_500 = df[df.Name.isin(top_500)]
    count_df = df_top_500.groupby('State').agg({'Name':'count'})
    total_occur = a.groupby('State').agg({'Num_of_occurrences':'sum'})
    count_df = pd.merge(count_df, total_occur,left_index=True, right_index=True)
    count_df['Name_percent'] = count_df['Name']/count_df['Num_of_occurrences']
    count_df.reset_index(inplace = True)
    ### plotting the bar chart of the top 500 words /num of occurences 
    
    plt.figure(figsize=(15,5))
    
    
    x = count_df.State.tolist()
    y_pos = np.arange(len(x))
    y = count_df['Name_percent'].tolist()

    plt.bar(y_pos, y, align='center', alpha=0.6)
    plt.xticks(y_pos,x,rotation=90)
    plt.ylabel('percentage of total occurrences')
    plt.title('Percentage of Babies with the Top 500 Names across States',fontsize=20) 
    
    max_state = count_df[count_df['Name_percent']==count_df['Name_percent'].max()].State.values[0]
    arrow_x = count_df[count_df['Name_percent']==count_df['Name_percent'].max()].index.values[0] + 0.5
    arrow_y = count_df['Name_percent'].max() 
    plt.annotate('max state is {} and the total count is {}'.format(max_state,arrow_y) \
             ,  xy=(arrow_x, arrow_y), xytext=(arrow_x+3, arrow_y), arrowprops=dict(facecolor="black", width=3,headwidth=10, shrink=0.2))
   
    
In [0]:
popular_converage(a)

Outcome

Approach 1:

a) Taxes at early times used the most number of unique names and around year 1955, New York and California surpassed it and owned the most number of unique names

b) California has the highest instances where it was the first state to use a name, followed by Taxes and New York

Approach 2:

Conclusion:

In [0]:
 
In [0]: